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ABSTRACT 

Target  Tracking  and  3-D  Scene  Analysis  are  two  research  areas  in  Computer 
Vision  which  in  the  past  have  been  considered  separately.  However,  there  are 
many  advantages  in  combining  the  two  problems.  One  such  advantage  would  be 
the  ability  to  analyze  and  build  a  model  of  a  stationary  scene/environment 
through  which  dynamic  objects  move.  This  is  possible  through  tracking  the  mov¬ 
ing  objects  and  detecting  instances  of  occlusion.  This  work  is  based  on  such  an 
idea  and  is  concerned  with  the  design  of  an  Intelligent  Target  Tracking  System 
(ITTS)  which  combines  the  above  two  problems  into  one.  In  this  paper  we 
present  an  experimental  lTTS  which  generates  a  perspective  and  ground  map  of 
a  stationary  environment. 


The  support  of  the  Defence  Advanced  Research  Projects  Agency  and  the  U.S.  Army  Night  Vision 
Laboratory  under  Contract  DAAK70-33-K-0018  (DARPA  Order  3506)  it  gratefully  ack¬ 
nowledged. 


1-  INTRODUCTION 


Target  Tracking  and  Scene  Analysis  are  two  research  areas  in  Computer 
Vision  which, in  the  past,  have  been  dealt  with  as  separate  problems.  Most  target 
tracking  systems  [3,5,9,10,18,23,26,27,30)  have  the  relatively  modest  goal  of 
detecting  and  identifying  moving  objects  and  tracking  them,  using  prediction  and 
correlation  techniques,  as  long  as  they  move  unobscured  across  the  field  of  view. 
Most  of  these  systems  have  been  develooed  with  a  set  of  stringent  real  time 
constraints  which,  given  current  hardware  technology,  constrains  them  to  employ 

a  set  of  computationally  simple  algorithms. 

.  ^  \ 

Most  scene  analysis/segmentation  systems  [11,17,19,25]  analyze  a  given  scene 

based  on  the  static  properties  of  objects  such  as  shape/structure  and  texture, 
where  the  class  of  objects  are  generally  predefined.  These  systems  tend  to  be  slow 
whenever  the  scene  is  complex,  containing  many  objects. 

There  are  advantages  to  combining  the  two  problems  into  one.  This  paper  is 
based  on  such  an  idea  and  is  concerned  with  the  design  of  an  Intelligent  Target 
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Tracking  System  (ITTS).  This  system  tracks  targets  based  not  only  on  models  for 

A 

targets  (shape,  motion,  etc)  but  also  on  models  of  the  environment  through  which 
the  targets  navigate  and  of  the  sensing  system(s)  employed  to  acquire  the  time- 


varying  images  on  which  the  analysis  is  based. 
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In  this  paper  we  utilize  the  dynamic  occlusion  of  moving  and  stationary  X. 

objects  by  one  another  to  place  bounds  on  distance  and  angular  extent  of  l; 
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location  of  the  stationary  objects.  Occlusion  of  a  moving  object  by  a  stationary  Cr 
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object  yields  an  upper  bound  on  the  distance  for  that  particular  stationary 
object,  while  occlusion  of  this  stationary  object  by  another  moving  object  yields  a 
lower  distance  bound  for  it.  We  detect  occlusion  using  object  models  and 
perspective  information.  Figure  1  illustrates  an  example  where  an  occluding 
object,  such  as  a  tree,  can  be  detected  by  tracking  a  moving  object  as  it  moves 
behind  the  tree.  During  the  course  of  tracking  the  moving  object,  we  are  able  to 
segment  the  stationary  scene  via  the  dynamic  occlusion.  This  allows  us  to 
construct  a  perspective  map  and  a  scaled  ground  map  of  the  environment. 

'  The  ITTS  is  composed  of  a  number  of  processes  which  can  be  divided  into 
three  processing  stages:  target  recognition  ,  segmentation  of  time-varying  images, 
and  scene  model  generation.  - pr. 

■J  ' 

Target  recognition  is  an  interactive  process  (in  the  current  version  of  our 
ITTS)  where  the  distance  and  orientation  of  a  moving  target  with  respect  to  the 
viewer,  at  a  particular  instant,  is  computed  from  a  zoom  picture.  The 
segmentation  stage  is  composed  of  a  series  of  processes,  including  image 
differencing,  noise  cleaning,  connected  component  detection,  and  region  extraction 
from  a  wide  angle  picture. 

The  scene  model  generation  stage  is  the  main  part  of  the  system  and  is 
composed  of  a  variety  of  processes  : 

I-  the  “main  program"  which  controls  the  system, 

2*  a  “matcher"  which  matches  the  predicted  appearance  of  moving 

targets  to  their  actual  appearance  in  a  given  frame  (instance). 


3-  a  “predictor”,  which  given  two  or  more  instances  of  a  moving  object’s 
appearance,  predicts  its  appearance  in  the  next  frame, 

4-  an  “analyzer”,  which  corrects  and  updates  all  the  data  structures 
based  on  the  closeness  of  a  match  between  predicted  and  actual 
appearance  of  a  moving  target, 

5-  a  “map  generator”  which  generates  two  kinds  of  maps,  a  perspective 
map  which  segments  the  scene  perspective^-  into  open  spaces  and 
stationary  portions  (where  there  are  stationary  objects),  and  a  scaled 
ground  map  of  the  environment  which  indicates  where  the  moving 
targets  have  been  observed  and  where  the  stationary  objects  are  located. 


2-  AN  EXPERIMENTAL  SYSTEM 


In  this  section  we  present  an  experimental  ITTS.  The  principal  goal  is  to 
build  a  model  of  the  stationary  scene  by  tracking  the  known  moving  targets.  We 
accomplish  this  by  detecting  instances  of  occasion,  of  which  there  are  two  kinds, 
“occlusion  from  the  front”,  where  a  target  moves  behind  a  stationary  object,  and 
“occlusion  from  behind”,  when  the  target  moves  in  front  of  the  stationary  object. 

2.1*  Problem  Domain  Definition 

The  shapes  of  dynamic  objects  are  modeled  as  rectangular  parallelepipeds  of 
constant  gray  level  (color);  their  images  are  nodeled  as  rectangles.  Three 
parallelepipeds  of  different  sizes  were  used  to  represent  our  targets. 

A  target’s  mobility  was  represented  by  its  velocity.  An  object  at  any  instant 
of  time  was  represented  by  its  type,  size,  location,  orientation,  velocity, 
trajectory,  and  information  about  related  (occluding)  stationary  objects. 

Stationary  objects  were  modeled  as  tall,  elongated,  rectangular 
parallelepipeds  in  3-D,  and  as  rectangles  in  2-D.  Their  appearances  were 
represented  by  size  and  location.  The  3-D  location  of '  a  stationary  object  is 
represented  by  a  pair  of  intervals  [  (  a,  ,  t»2  )  ,  (  dj  ,  dj  )  1  where  the  true  line  of 
sight  to  the  stationary  object  is  included  in  the  interval  (df|ra«)  and  the  true 
range  is  included  in  (  );  see  Figure  2. 


We  adopt  a  two  camera  system  model.  One  has  a  wide  angle  lens  (10  mm), 
viewing  the  scene  from  a  fixed  perspective,  and  another  has  a  zoom  lens  (25  mm) 
capable  of  panning  around  the  Y  axis.  Figure  3  illustrates  the  camera  model. 
Images  taken  by  the  first  camera  are  used  by  the  segmentation  algorithm.  Images 
taken  by  the  zoom  lens  camera  are  used,  in  conjunction  with  object  models,  to 
determine  the  distance  and  orientation  of  moving  objects  in  a  particular  frame. 
The  motivation  for  using  a  zoom  lens  camera  is  that  the  high  angular  resolution 
allows  accurate  computation  of  the  distance  and  orientation.  It  is  assumed  that 
all  the  camera  parameters  are  known. 

2.2-  Data  Representation 

Our  system  includes  two  different  groups  of  data  structures,  one  symbolic 
and  the  other  iconic.  The  symbolic  data  structure  includes  a  pair  of  graphs,  one 
for  representing  the  dynamic  part  of  the  scene  and  another  for  representing  the 
stationary  part  of  the  scene  see  Figure  4.  The  two  are  linked  to  each  other  at  the 
region  and  object  level  The  graph  representing  the  dynamic  objects  in  the  scene 
is  interpreted  as  follows:  Analysis  of  the  observed  data  over  a  period  ol  F  frames 
has  resulted  in  the  detection  and  tracking  of  N  dynamic  objects,  where  object  “i" 
has  been  seen  for  a  sequence  of  .Y,-  frames,  and  in  each  frame  the  object  has 
appeared  to  consist  of  one  or  more  dynamic  regions. 

The  interpretation  of  the  stationary  component  for  the  same  period  is  as 
follows  :  Tracking  the  N  moving  objects  has  resulted  in  recognition  of  a 


stationary  scene  (environment)  composed  of  M  stationary  objects,  each  consisting 
of  one  or  more  stationary  regions;  see  Figure  4. 

It  should  be  noted  that  the  degree  of  the  APRANC  node  (which  specifies  the 
number  of  regions  comprizing  the  object  at  each  frame)  for  any  moving  object  in 
the  most  recent  frame  (where  the  APRANC  initially  represents  a  prediction)  can 
be  changed  due  to  a  detailed  analysis  of  the  appearance  of  that  object  by  the 
matching  process.  Of  course  an  APRANC  node  of  degree  two  represents  “middle 
occlusion"  explicitly.  For  example  ,  the  object  in  Figure  l.c  is  represented  by 
node  “A"  of  Figure  4,  indicat  irg  the  fact  that  object  A  in  the  first  frame  was 
occluded  in  the  middle,  and  therefore  represented  by  two  regions. 

Occlusion,  in  general,  is  represented  in  the  symbolic  data  structure  by 
establishing  links  between  the  dynamic  and  stationary  region  and  object  nodes. 
For  example  the  two  dynamic  regions  of  Figure  l.c  are  represented  by  nodes 
DYNREG#1  and  DYNREG#2  in  Figure  4  and  the  node  APRANC' #1  represents 
the  fact  that  this  object  is  occluded  in  the  middle.  The  stationary  region  between 
the  two  dynamic  regions  of  Figure  l.c  is  represented  by  the  node  STNREG#i. 
This  particular  instance  of  occlusion  is  represented  by  the  explicit  links  between 
STNREG#1  and  DYNREG#!  and  STNREG#!  and  DYNREG#2. 

The  second  symbolic  data  structure  is  a  two  dimensional  array  containing 
the  range  information  for  the  detected  stationary  objects.  Every  row  of  this  array 
represents  a  detected  stationary  object  with  enough  information  to  place  a  bound 
on  the  area  in  the  scene  where  the  stationary  object  resides.  There  are  hi* 
directional  links  between  every  row  of  this  array  and  the  appropriate  DYNOBJ 


and  STN'OBJ  nodes  of  the  graph  pair  (described  earlier  in  this  Section).  This 
array  is  used  and  updated  by  the  predictor  and  the  analyzer.  The  ground  map 
generator  also  uses  this  array.  The  third  symbolic  data  structure  is  also  a  two 
dimensional  array  containing  the  3-D  information  concerning  the  moving  targets, 
which  results  from  the  target  recognition  stage. 

The  iconic  data  structures  (maps)  are  the  other  class  of  data  structures  that 
are  employed.  Our  ITTS  incorporates  two  different  maps,  a  perspective  map  and 
a  ground  map.  The  perspective  map  represents  the  detected  stationary  regions 
and  in  turn  the  stationary  objects.  It  also  represents  the  detected  visible  portion 
(space)  of  the  environment,  indicating  the  paths  traveled  (in  perspective)  by  the 
moving  objects. 

The  visible  space  content  of  the  perspective  map  is  accumulated  by  the 
matching  process,  and  the  stationary  region  portion  is  accumulated  by  the 
predictor  and  the  error  analyzer.  Figure  14  shows  some  examples  of  the 
perspective  map.  The  light  regions  represent  the  visible  portions  of  the  scene, 
while  the  darker  regions  represent  the  stationary  portions  of  the  scene  (the 
stationary  objects). 

The  ground  map  represents  a  scaled  map  of  the  scene,  where  each  target  at 
any  given  frame  (time  instant)  is  represented  by  its  location,  trajectory,  and 
velocity.  The  accumulation  of  these  for  any  target  represents  the  path  along 
which  the  target  has  been  moving.  The  ground  map  also  represents  the  areas  of 
the  ground  where  the  stationary  objects  reside.  This  map  is  constructed  by  the 
“map  generator”.  Figure  15  gives  some  examples  of  the  ground  map.  The  map 
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represents  a  70”  by  70”  scene  and  each  side  of  a  square  in  the  grid  represents 
approximately  2.5  inches  of  the  real  world.  The  bright  vectors  represent 
instances  of  the  moving  objects  and  their  velocity  ,  while  the  darker  vectors  (in 
some  cases  making  up  polygons)  represent  regions  of  the  ground  where  stationary 
objects  have  been  found  to  reside.  The  bottom  of  Figure  15  indicates  the 
location  of  the  camera  (origin  of  the  world  and  camera  coordinate  system).  The 
two  dark  vectors  pointing  towards  the  upper  sides  of  the  image  represent  the 
angular  field  of  view  of  the  wide  angle  lens. 

2.3-  Process  Definition 

This  Section  describes  the  processing  components  of  the  ITTS  individually. 
Figure  5  illustrates  these  processes  schematicaLy. 


2.3.1-  Target  Detector 

Input  to  this  process  is  a  set  of  gray  level  images  and  a  mask  frame  (an 
image  of  the  stationary  environment  only),  all  taken  by  the  wide  angle  lens 
camera.  It  processes  one  image  at  a  time  and  produces  a  set  of  region 
descriptors  (in  our  case  parameters  describing  a  rectangle)  representing ,  the 
dynamic  regions.  The  target  detector  consists  of  the  following  processes: 

1*  take  the  difference  between  an  image  and  the  mask  frame, 

2-  apply  thresholding  to  the  difference  picture, 


3-  apply  shriuk  and  expand  operations  (noise  cleaning)  to  the  image 
obtained  in  step  2, 


4-  apply  connected  components  analysis  to  the  image  obtained  from  step 

3, 

5-  Fit  upright  bounding  rectangles  to  components  remaining  after  step 

4. 

6-  apply  local  image  differencing  to  a  window  surrounding  each  detected 
region  to  refine  the  estimates  of  the  shapes  of  the  dynamic  regions  (in 
this  step  the  thresholding  process,  after  differencing,  uses  a  threshold 
value  much  lower  than  in  step  2). 


Figure  6  shows  an  example  of  the  target  detector’s  output. 


2.3.2-  Target  Recognition 

In  order  to  generate  a  model  of  the  stationary  scenu  and  approximate  areas 
of  the  seen?  where  stationary  objects  reside  based  on  occlusion  by/of  moving 
objects,  we  need  to  estimate  the  location  and  orientation  of  these  moving  objects 
in  the  scene.  This  is  done  based  on  the  known  physical  dimensions  and  measured 
image  heights  of  the  moving  objects.  To  better  resolve  the  height  of  a  target  in 
the  image  we  use  pictures  of  the  targets  taken  by  a  zoom  lens  camera. 

Given  any  of  these  pictures;  the  user  interactively  segments  a  face  of  the 
target  (preferably  the  one  in  the  center  of  the  image)  by  positioning  four  points, 
two  on  the  bottom  edge  of  the  face  and  two  on  the  upper  edge.  This  will  fit  a 
polygon  to  this  particular  face  as  shown  in  Figure  7.  Next  we  describe  the  actual 


estimation  of  distance. 


Figure  8  shows  a  target  in  the  world  coordinate  system  and  its  relationship 
with  the  image  coordinate  system.  The  vertices  A,B,C,  and  D  map  into  the  points 
a,b,c,  and  d  on  the  image  plane,  respectively,  and  from  the  relation  between  these 
vertices  and  points  we  can  easily  compute  the  distance  and  orientation.  It  is  easy 
to  show  that 
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where  | is  the  desired  distance,  H  is  the  height  of  the  target  and  A;  is 
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To  compute  the  orientation,  p  of  the  planer  face,  we  average  the  orientation 
of  the  top  and  the  bottom  edge  of  the  face.  The  orientation,  p,  is  given  by 
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2.3.3-  Scene  -  Model  Generation 


The  components  of  the  scent-model  generation  process  are  described  in  the 


following  subsections. 
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2,3.3. 1-  Matcher 


This  process  compares  the  predicted  appearance  of  dynamic  objects  for 
frame  n  with  the  actual  observed  data  in  the  n-th  frame.  Since  both  the 
predicted  and  actual  appearance  of  the  dynamic  objects  are  represented  as 
regions  (in  the  image  plane)  which  in  turn  are  represented  by  bounding 
rectangles,  the  matching  is  done  among  these  rectangles.  These  rectangles  are 
represented  by  the  four  parameters  <c,r,nc,nr>  where  c  =  column,  r  =  row,  nc 
=  number  of  columns,  nr  =  number  of  rows.  The  first  two  indicate  the  location 
and  the  other  two  the  size  of  a  rectangle. 

The  matching  process  measures  the  similarity  between  a  predicted 
appearance  Rp  and  a  region  Rrf  in  the  set  of  dynamic  regions  in  the  current 
frame.  The  followings  steps  are  taken  during  the  matching  of  an  Rp  with  an  Rrf  : 

1-  Find  all  R^'s  from  the  set  of  all  Rrf ’s  such  that 

ei<ff  —  \  ep  ~  1  <  halve ,  ,  rajj  —  \rp-  r4  f  <  tvalue2  , 

nediff  =  I  nep  ~  ned,  I  <  lvalue,  ,  nrdiff  a*  j  nrp  -  nr^  |  <  lvalue i 
All  the  “tvalues”  are  predefined  threshold  values. 

2-  Choose  the  R^  such  that  'diff  >  rdiff  >  ne^  ,and  nr^  are 
minimum  for  it  and  TRp  *  TR^  where 

TRp  and  TRit  are  two  trajectories  computed  as  follows: 

TRp  =«  DIRECTOR  (R*  ,  Rp  ) ; 

TRt,  m  DIRECTOR  (  R* ,  R*  ) ; 

where  R*  is  the  last  appearance  of  the  object  represented 
byRr 

Note  :  director  returns  one  of  the  following  directions  :  N  ,  S  ,  E  , 

W  ,  NE  ,  NW  ,  SE  ,  SW  based  on  the  comparisons  between  e,r,nc, 
and  nr. 
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The  above  steps  describe  the  matching  process  between  any  Rp  and  R^.  It 

should  be  realized,  however,  that  the  process  generally  matches  a  Rp  ’s  with 

m  Rrf  ’s  (or  vice  versa).  Depending  on  the  relation  between  n  and  zn,  different 

cases  may  arise.  Some  of  these  cases  are  as  follows  : 

1-  Success  :  This  is  the  case  where  n  =  m  and  every  Rp  is  matched  to 
one  and  only  one  R^. 


2-  An  Rp  has  no  corresponding  Rrf :  This  is  typically  due  to  an  incorrect 
predicted  decomposition;  the  system  had  predicted  two  regions 
(R^  and  R^)  representing  an  object,  but  only  one  Rrf  is  matched  to 
either  or  or  both,  resulting  in  an  ambiguity  indicating  an 
incorrect  prediction. 

Another  possible  cause  of  this  case  is  that  the  object  is  eurr  tly  totally 
occluded,  but  the  system  had  predicted  that  it  would  see  sor  f  it.  In 
our  set  of  image  frames  this  case  did  not  occur. 


3-  An  R^  has  no  corresponding  Rp  :  There  are  two  possibilities.  Either  a 
new  object  has  entered  the  scene,  or  an  object  has  been  decomposed 
unexpectedly(middle  occlusion). 


a)  An  object  is  classified  as  new  if  either  of  the  following  is 
true: 


1*  Cj  5s  G+< 

2-  rd  <  R+e 

3-  ed  +  ne4  >  NC-e 

4 -  rd+  nrt  >  NR-e 


Target  entering  from  LEFT 
Target  entering  from  BOTTOM 
Target  entering  from  RIGHT 
Target  entering  from  TOP 


Picture  frame  =  <C,R,NC,NR>  and  e  is  some  pixel 
count  threshold! i.e.  in  our  case  e«50). 


If  this  is  the  case  the  count  for  the  dynamic  objects  is  increased 
by  one,  and  corresponding  DYNOBJ,  FRAME,  APRANC,  and 
DYNREG  nodes  of  Figure  4  are  generated.  This  new  rectangle 
R^  is  painted  as  visible  space  in  the  iconic  data  structure. 

b)  If  R*  does  not  represent  a  new  incoming  object,  then  indeed 
it  must  represent  part  of  an  existing  moving  object.  We  assume 
that  to  whichever  object,  X,  this  Rrf  belongs,  the  other  part  of 
X  has  already  been  matched  to  some  other  region,  say  Rrf- . 
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The  following  steps  will  find  an  R/  which  in  combination  with 
R^  represents  object  X  : 


for  h=l,N  do 

Rem  :  N  =  #  of  regions  extracted  from  the  current 
frame. 

Find  an  R^-  which  has  already  been  matched  to 
an  Rr 

If  cj  -  (cj  +  ned)  <  avgtreewidth  and  \  rd-  -  rd\  <  e 
or  , 


if  ed  -  (ed-  +  ned)  <  avgtreewidth  and  I  »y  -  rd  |  < 
then  Rd  to  point  to  whichever  object 
the  R^-  points  to. 

endif 

endif 

endo 

endfor 


e 


After  finding  Rd-,  update  the  symbolic  and  iconic  data 
structures  to  indicate  this  decomposition  of  object  X.  More 
specifically  the  degree  of  the  APRANC  node  is  changed  from  1 
to  2. 


4-  An  Rd  has  multiple  corresponding  R?’s  :  This  is  a  special  instance  of 
the  second  case  . 

5-  An  Rp  has  multiple  corresponding  R/s  :  This  is  a  special  instance  of 
the  third  case. 


2.3. 3. 2-  Predictor 


This  module  predicts  the  future  appearance  of  a  dynamic  object  (in  3-D  and 
then  in  2-D)  based  on  its  type  and  : 

1-  its  current  3-D  status  (location,  orientation,  velocity,  and  trajectory), 

2-  its  previous  3-D  status,  and, 

3-  the  accumulated  information  about  the  stationary  \i"»rld. 

The  location  of  a  target  in  the  real  world  is  the  location  of  a  reference  point 
on  that  target  in  the  real  world.  This  reference  point  is  chosen  to  be  the 
intersection  of  the  optical  axis  of  the  zoom  lens  camera  system  and  the  center  of 
a  particular  face  of  the  target.  We  modeled  the  targets  so  that  the  origin  would 
lie  in  the  center  of  either  a  Right  or  Front  face  of  the  target. 

If  Pt  {Xt,Zt)  denotes  the  location  of  a  target  in  the  scene  at  time  t  where 

,Y,  =  sin/?,  and  Zt=*  dtX cos/?,  (4) 

d  :  the  distance  to  the  target  computed  by  the  procedure  of  Section  2.3.2 
j3  :  the  pan  angle  by  which  this  particular  zoom  picture  was  taken, 

then  P»x  will  represent  the  location  of  the  target  at  time  t-i-I,  and  is  computed 
by 

Pt+,  =  Pt  +  A&t  +  B&P,  (5) 

where  the  second  and  third  terms  on  the  right  hand  side  represent  velocity  and 
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acceleration/deceleration  rate  of  that  target,  respectively.  Writing  (5)  in 
component  form  yields 


Xt+l  =  Xt  +  AAt  +  BAP,  (6) 

Zt+ ,  —  Zt  +  AAt  +  BAP.  (7) 

In  order  to  use  (6)  and  (7)  we  need  to  compute  the  coefficient  A  and  B.  It  is 
easy  to  show  that 


A  = 


(*t-2  -  +  3Xt) 


(8) 


(Xt_*  -  2AM  +  A't) 


(9) 


or,  equivalently,  that 


A  = 


(Zt_2  -  4Zt_ j  +  3 Zt) 


(10) 


5  = 


(Zt- 2  -  2  Zt_y  4-  Zt) 


UD 


Substituting  (8)  and  (0)  into  (8),  and  (10)  and  (11)  into  (7)  gives 


v  (*<_, +  3At)  (Xt.2  -  2Xul  +  Xt) 

Xt+l  *  Xt  +  .  11  2  +  . . 2  1  "  ■  (12) 


7  „  ,  (^t-2  “  4^t-i  +  $Zt)  (Z(.2  -  2Zt_,  +  Zf) 

Zt+1  =*  Zt+ - — — . .  ■  +  - - — — .  (13) 


and  simple  algebra  yields 


At+i  —  3A,  -  3Af_|  +  Aj.o,  (14) 

Zt+ 1  =  3?t  -  3Z(_j  +  Zt_2-  (15) 

It  should  be  realized  that  (14)  and  (15)  are  true  for  t  >  2;  for  the  case  where 
t  =  2  the  following  simpler  equations  are  used: 


—  o  V  _  V  . 

-  — »f 

(16) 

1 

Nf 

<M 

II 

(17) 

Equations  (14)  and  (15)  or  (16)  and  (17)  give  the  translational  components  of  (5); 
next  we  compute  the  rotational  component.  Let  a,  represent  the  orientation  of  a 
target  at  time  t  with  respect  to  the  X-axis  of  the  real  world  coordinate  system. 
Associated  with  (5)  is  a  rotational  component  computed  by 

a,  ==  a,  +  AAt  +  B&P,  (18) 

where  the  second  and  third  terms  of  the  right  hand  side  represent  change  in 
orientation  and  rate  of  change  in  orientation  respectively.  Solving  for  A  and  B  (aa 
we  did  for  the  case  of  X  and  Z)  and  substituting  them  in  (18)  we  get 

v*i+i  *  5otj  —  4*  for  2,  (1®) 

ae+l  a*  2a,  -  Am  for  fas 2,  (20) 

which  completes  the  transformation  a  target  will  go  through  from  the  time  t  to 
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t-l-J.  In  order  tc  complete  the  3-D  prediction  process,  all  the  vertices  of  a  target 
(its  model)  should  go  through  the  computed  transformations. 

Let  Pw  {XVi  t  ,  ZWi +1)  represent  a  point  (preferably  a  corner)  of  a  target  in 
the  real  world  at  time  t+1,  and  Pm(Xm ,  Z„ J  represent  the  corresponding  point 
on  the  target’s  model.  Then  the  3-D  prediction  for  a  target  is  computed  for  every 
corner  of  the  model  according  to 


This  completes  the  first  step  of  the  prediction  process.  Next  we  compute  the 
perspective  projection  of  the  points  computed  by  (21).  This  is  accomplished  using 
the  following  equations  : 


*H-I 


/, 


(22) 


y<+«  = 


(23) 


where  p,+1(xt+J  ,  r,+I)  is  a  point  on  the  image  corresponding  to  P¥iJ Xmm  ,  ZWl^), 
L  is  the  camera  height,  and  f  is  the  focal  length  of  the  wide  angle  lens. 

The  above  procedure  Will  give  us  eight  points  (the  eight  corners  of  a 
rectangular  parallelepiped),  to  which  we  next  St  the  best  rectangle,  Ef+,  which 
represents  a  region  on  the  image  at  time  t+ 1  for  which  we  predict  the  presence  of 
a  particular  target. 
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Having  predicted  R{+1,  we  check  for  possible  occlusion,  for  which  if  one 
exists,  Rf+,  is  modified  appropriately.  To  check  for  possible  occlusion,  we  need  to 
compute  the  angle  along  which  the  target  is  predicted  to  be  seen.  This  angle  is 
simply  computed  using  (14)  and  (15)  or  (18)  and  (17)  as  follows  : 

0t+l  =  arctan  (4r^)  (24) 

^t+i 

Next  we  compare  /?f+1  entries  in  the  symbolic  data  structure 

representing  the  range  information  regarding  the  detected  stationary  objects  (see 
Section  2.2).  If  a  match  exists,  then  the  target  may  be  occluded  depending  on  the 
accumulated  distance  information  for  that  particular  stationary  object  and  the 
predicted  distance  for  this  target.  If  distance  comparison  yields  an  occlusion,  then 
Rf+,  is  modified  based  on  the  region  R,  representing  the  stationary  object. 

This  modification  is  done  in  one  of  the  following  two  ways:  either  the  size  of 
R(+,  gets  smaller  (the  case  when  the  target  will  be  occluded  on  one  side),  or  R,+1 
splits  into  two  pieces  R|+,  and  R?+l  (the  case  where  the  target  will  be  occluded 
somewhere  in  the,  middle).  See  Figure  0. 

The  above  modifications  would  hold  if  the  stationary  object  was  known  to 
have  been  completely  detected.  Otherwise,  Rt+t  and  R,  are  modified  on  the 
assumption  that  this  stationary  object  is  as  wide  as  possible.  For  example,  if  a 
stationary  object  (represented  by  R,)  is  partly  detected  and  its  observed  width  is 
w,  then  it  is  possible  that  we  have  not  yet  detected  the  remaining  part  of  the 
object,  r„  with  a  width  equal  to  W-w  (if  VV  is  the  maxima)  width  of  any 


stationary  object).  Therefor,  R,+1  would  be  modified  as  if  it  was  occluded  by  a 
region  of  width  W.  See  Figure  10. 

The  last  step  in  the  prediction  process  is  the  updating  of  the  graph  pair  data 
structures,  range  information  data  structure,  and  the  perspective  and  ground 
maps.  The  graph  pair  is  updated  by  generating  DYNREG  node(s),  corresponding 
STNREG  node  (if  any  r#  is  predicted),  and  FRAMES  node  and  labeling  all  these 
nodes  as  P  (Prediction).  These  predictions  will  be  checked  against  the  actual  data 
by  the  analyzer  at  time  t+1. 

2.3.3. 3-  Analyzer 

The  analyzer  examines  the  reports  from  the  matching  process.  If  the  errors 
(mismatches)  are  below  some  predefined  threshold,  then  the  predictions  are 
accepted  and  all  data  structures  are  updated  appropriately.  Otherwise  a  search  is 
initiated  to  determine  the  cause  of  the  failure. 

The  search  can  be  best  described  by  giving  an  example.  Figure  ll.a  shows  a 
predicted  target’s  image  X  with  an  associated  known  part  of  a  stationary  object 
A.  Figure  ll.b  shows  the  actual  appearance  of  the  object.  By  comparing  these 
two  (a  and  b)  and  considering  the  values  of  different  parameters,  different 
conclusions  can  be  made.  For  example,  if  Pi  **  P2  and  11  *  12,  then  the 
hypopaper  about  the  stationary  object  to  be  extended  (perspectively)  by  width  A 
from  the  right  is  wrong,  and  the  actual  width  of  the  stationary  object  is  B.  Or,  if 
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if  A'l^A’2,  then  we  conclude  that  the  previous  estimate  of  this  target’s  velocity  is 
incorrect  and  should  be  corrected  b.y  the  value  X2  -  Xl. 

The  analyzer  is  also  responsible  for  the  detection  of  unpredicted  stationary 
regions  (and  in  turn  stationary  objects).  This  is  done  as  follows  :  Suppose  a 
predicted  region  Rp  is  matched  to  a  region  of  the  current  frame  with  an  above 
threshold  error.  Furthermore  assume,  that  one  of  the  two  ends  of  Rp  and  Rrf 
match  (  i.e.  cf=ei  OR  (ep+ncp)=(c^+ncj)  and  the  length  of  Rp  is  larger  than 
the  length  of  R^  (i.e.  nep<ncj)  .  Then  it  is  hypothesized  that  there  exists  a  piece 
of  a  stationary  region  (  representing  part  of  a  stationary  object)  with  width 
InCp-ncJ  located  to  the  left  or  right  of  the  target,  depending  on  the  trajectory  of 
the  moving  object. 


2.3.3.4-  Map  Generator 


The  map  generator  is  responsible  for  constructing/updating  the  perspective 
and  the  ground  map.  Both  of  these  maps  are  510  by  510  8-bit/pixel  pictures.  The 
perspective  map  is  constructed  and  updated  bp  painting  regions  (rectangles) 
corresponding  to  the  dynamic  regions  (R/a)  and  stationary  regions  (R/ a)  whose 
parameters  are  taken  from  the  symbolic  graph  pair  data  structure  (see  Section 
3.2).  Figure  14  contains  some  examples  of  the  persjpective  map. 


The  map  generator  is  also  responsible  for  tho  construction  and  maintenance 
of  the  ground  map.  For  every  instant  of  a  mt  ving  target  a  vector  is  drawn 
representing  its  location  (in  3-D),  orientation,  and  velocity.  Ail  of  the  areas  on 
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the  ground  (in  3-D)  where  stationary  objects  have  been  detected  to  reside  are  also 
indicated  by  drawing  vectors  bounding  them.  Figure  15  contains  examples  of  the 
ground  map. 

2.3.3.5-  Occlusion  From  Behind  Detector 

This  procedure  detects  all  those  stationary  objects  which  are  occluded  by  the 
moving  targets,  by  taking  the  difference  between  every  dynamic  region  Rf 
corresponding  to  DOBJit,  and  the  same  region  in  the  mask  frame,  resulting  in  a 
difference  region  .R dijj. 

The  region  R i{[/  is  then  thresholded  based  on  the  predefined  gray  level 
values  corresponding  to  the  dynamic  and  stationary  objects,  yielding  the 
thresholded  region  Rfjkrtf*.  If,  in  fact,  there  exists  a  stationary  object  behind  this 
dynamic  object  DOBJ{<,  then  the  region  Rt4r(f4  would  contain  non- zero  elements, 

i 

to  which  we  thnn  fit  a  rectangle  Rr  R,  is  then  entered  into  the  symbolic  data 
structures  and  the  iconic  data  structures.  Examples  o;'  this  detection  are 
illustrated  in  Figures  14.a,  14.c,  15.c,  and  I5.e. 


3-  EXPERIMENTAL  RESULTS 


This  section  includes  results  of  processing  a  sequence  of  twenty  picture 
frames  by  our  ITTS.  The  images  are  510  by  510,  eight  bit  pixels.  Dark  objects 
represent  the  stationary  objects  and  bright  objects  represent  the  moving  targets. 
There  were  three  different  sizes  of  moving  targets,  and  a  variety  of  stationary 
objects. 

The  scene  covers  an  area  of  approximately  70”  by  70”.  Figure  12  shows  the 
sequence  of  twenty  frames  plus  the  mask  frame  where,  in  order  to  save  space,  the 
510  by  510  images  have  been  clipped  into  510  by  175  segments  (areas  of  interest). 
Figure  13  illustrates  the  scene  layout  with  the  paths  along  which  targets  move. 
Figure  14  shows  some  snapshots  of  the  perspective  map  where,  again  to  save 
space,  the  images  are  clipped  into  510  by  175  pieces.  The  bright  regions  represent 
the  visible  portions  of  the  scene  (where  the  targets  have  been  seen)  and  the  dark 
regions  represent  the  stationary  objects.  The  perspective  map  changes  as  more 
picture  frames  are  processed. 

It  should  be  noted  that  the  stationary  objects  are  not  represented  by  their 
full  height;  due  to  the  altitude  of  the  camera  with  respect  to  the  ground  plane, 
the  tops  of  stationary  objects  cannot  interact  with  dynamic  objects.  More 
accurate  estimate  of  heights  could  be  detected  if  the  cameras  were  looking  down 
at  the  scene  so  that  the  targets  at  different  distances  pass  behind  and  are 
occluded  by  the  stationary  objects. 
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Figure  14. a  shows  the  perspective  map  after  processing  one  frame.  The 

stationary  region  detected  in  this  frame  is  due  to  the  procedure  for  detecting 

occlusion  frot  .  behind  (see  Section  2.3.3.S  for  details);  this  is  also  indicated  in  the 

X 

ground  map,  as  shown  in  Figure  15.a.  Other  instances  of  this  kind  of  occlusion 
detection  are  shown  in  Figure  14.c  and  16.c  where  in  each,  the  rightmost 
stationary  region  is  detected  by  the  above  procedure. 

Figure  12  not  only  shows  sequence  of  the  20  frames,  but  also  shows  the 
results  of  the  predictor.  What  is  shown  in  this  figure  are  different  frames,  with 
the  prediction  for  that  particular  frame  overlaid  on  it.  For  instance,  Figure  12.a 
shows  the  third  frame;  the  rectangles  represent  predictions  for  this  frame, 
obtained  in  the  second  frame.  Since  the  predictor  in  the  second  frame  knew  about 
the  existence  of  the  leftmost  stationary  object  in  the  scene,  which  would  be 
occluding  the  topmost  moving  target  at  the  third  frame,  two  regions  were 
predicted  to  represent  this  particular  moving  target. 

Another  example  of  such  double  region  prediction  is  shown  in  Figure  12.g  for 
the  bottom  target.  It  is  evident  that  this  particular  prediction  is  wrong.  As  can 
be  seen  in  Figure  15>  representing  the  ground  map  (explained  fully  later  in  this 
Section),  after  processing  the  tenth  frame  there  is  only  a  maximal  distance 
associated  with  the  second  stationary  object  from  the  left.  At  this  time,  all  the 
FITS  knows  about  this  particular  stationary  object  b  that  it  b  located 
somewhere  between  the  viewer  and  its  maximal  dbtance  point.  Now,  since  the 
predicted  location  of  the  moving  target  b  within  this  interval,  the  predictor  after 
considering  the  possibilities  of  occlusion,  predicts  two  regions.  However  at  the 


twelfth  frame  the  analyzer  detects  this  mistake  and  corrects  all  data  structures. 

Figure  15  shows  examples  of  the  ground  map  (as  described  in  Section  2.2) 
generated  by  the  map  generator  of  Section  2.3.5.  Figure  15. a  shows  the  ground 
map  after  processing  the  first  frame  and  Figure  15.j  the  result  after  twenty 
frames,  where  the  difference  between  them  is  apparent.  It  should  be  noted  that 
one  of  the  goals  for  our  FITS  was  producing  a  map  which  would  resemble  and 
approximate  the  map  shown  in  Figure  13.  This  task  was  accomplished  through 
the  generation  of  the  ground  map  shown  in  Figure  15.j. 

The  analyzer  is  one  of  the  processes  responsible  for  producing  the 
information  used  in  generating  the  ground  map.  We  now  give  some  examples  of 
how  the  analyzer  works.  The  first  example  concerns  the  case  of  the  predictor’s 
mistake,  which  was  discussed  earlier  in  this  Section.  At  the  eleventh  frame,  the 
analyzer  detects  this  mistake  by  realizing  that  1)  the  two  predicted  Rps  match 
only  one  Rj,  2)  the  R4  bounds  the  two  Rp'a,  and  3)  that  the  predictor's  decision 
based  on  the  maximal  distance  for  the  corresponding  stationary  object  was 
incorrect.  This  leads  the  analyser  to  realize  that  the  moving  target  should  be  in 
front  of  this  particular  stationary  object,  yielding  a  bound  on  the  minimal 
distance  to  the  stationary  object.  This  is  reflected  in  shortening  he  vector 
representing  the  interval  ih  which  the  stationary  object  is  located,  as  shown  in 
Figure  15.f.  A  similar  analysis  is  done  at  the  sixteenth  frame,  for  the  moving 
object  at  the  bottom. 
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The  analyzer  is  also  responsible  for  tightening  the  intervals  and  regions 
representing  the  location  of  the  stationary  objects;  this  is  illustrated  in  Figures 
15.a  through  15.j. 
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a)  no  interaction  b)  target  occluded  on  the  right  side 


c  )  middle  occlusion 
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Figure  1. 

av  b,  ,c  and  c,  show  instances  of  a  target  in  3-D; 
a  ,  b  ,  andc  show  its  corresponding  regions. 
One  can  hypothesize  (at  or  after  that  there 
exists  an  occluding  object  somewhere  between 
the  viewer  and  the  target. 


an  angular  interval  (a|  ,  a2).  b)  Detection  of  a 
minimal  distance  for  the  same  object  “x”.  (a) 
and  (b)  together  approximate  the  location  of 


The  camera  model. 


Figure  5. 

Tlif*  processing  components  of  our 
\  experimental  FITS. 


Figure  (i .  "'V  • 

An  example  df  target  detection;  preprocessing 
pictures  t;iken  by  the  wide  angle  lens  camera. 


An  example  of  range  detection;  a  polygon  is 
fitted  to  a  face  of  a  moving  target  in  a  picture 
taken  by  the  zoom  lens  camera. 
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Figure  12. 


The  sequence  of  20  frames  (our  input  data), 
where  the  black  and  white  windows  represent 
the  predictions  for  each  moving  target,  a) 
shows  the  mask  frame,  b)  through  u)  show  the 
sequence  of  20  frames.  It  should  be  noted  that 
at  the  9th  and  12th  frames,  the  leftmost 
moving  objects  are  just  entering  the  scene. 


Figure  12  (cont.)  - 
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Figure  13. 

This  figure  represents  the  ground  map  where 
all  the  points  are  measured.  “X”  represents 
location  of  each  moving  target  with  the 
particular  frame  number  above  it.  The  black 
circles  repiesent  the  location  of  the  stationary 
objects. 
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